Text Classification and Multilinguism: Getting at Words via N-grams of Characters
نویسندگان
چکیده
Genuine numerical multilingual text classification is almost impossible if only words are treated as the privileged unit of information. Although text tokenization (in which words are considered as tokens) is relatively easy in English or French, it is much more difficult for other languages such as German or Arabic. Moreover, stemming, typically used to normalize and reduce the size of the lexicon, constitutes another challenge. The notion of N-grams of words (i.e. sequences of N words, with N typically equals to 2, 3 or 4) which, for the last ten years, seems to have produced good results both in language identification and speech analysis, has recently become a privileged research axis in several areas of knowledge extraction from text. In this paper, we present a text classification software based on N-grams of characters (not words), evaluate its results on documents containing text written in English and French, and compare these results with those obtained from a different classification tool based exclusively on the processing of words. An interesting feature of our software is that it does not need to perform any language-specific processing and is thus appropriate for multilingual text classification.
منابع مشابه
Improving K-Nearest Neighbor Efficacy for Farsi Text Classification
One of the common processes in the field of text mining is text classification.Because of the complex nature of Farsi language, words with separate parts and combined verbs, the most of text classification systems are not applicable to Farsi texts.K-Nearest Neighbors (KNN) is one of the most popular used methods for text classification and presents good performance in experiments on different d...
متن کاملWhich Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?
This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText (Joulin et al., 2016) an...
متن کاملA New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese
In the process of establishing the information theory, C. E. Shannon proposed the Markov process as a good model to characterize a natural language. The core of this idea is to calculate the frequencies of strings composed of n characters (n-grams), but this statistical analysis of large text data and for a large n has never been carried out because of the memory limitation of computer and the ...
متن کاملExperimenting N-Grams in Text Categorization
This paper deals with automatic supervised classification of documents. The approach suggested is based on a vector representation of the documents centred not on the words but on the n-grams of characters for varying n. The effects of this method are examined in several experiments using the multivariate chi-square to reduce the dimensionality, the cosine and Kullback&Liebler distances, and tw...
متن کاملSyntactic Dependency-Based N-grams as Classification Features
In this paper we introduce a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner of what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking the words as they appear in the text. Dependency trees fit directly into this idea, while in case of constituency...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002